Coding for DS and DM
R coding module

Lecture 4

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Loop functionals: apply

  • The apply() function is used to evaluate a function over the “margins” of an array.
  • It is most often used to apply a function to the rows or columns of a matrix or data.frame.
M <- matrix(1:20, ncol = 5, nrow = 4, byrow = TRUE)
print(apply(M, MARGIN = 2, FUN = mean)) ## Compute the mean of each column
[1]  8.5  9.5 10.5 11.5 12.5
print(apply(M, MARGIN = 1, FUN = mean)) ## Compute the mean of each row
[1]  3  8 13 18

Loop functionals: tapply

  • The tapply() function is used to evaluate a function to each group of values defined by a factor
x <- c(rnorm(10), runif(10), rnorm(10, 1)) 
f <- gl(3, 10)
tapply(X = x, INDEX = f, FUN = mean)
        1         2         3 
0.1109914 0.5669564 1.0563289 
  • Also useful in dataframes that contain a factor
tapply(X = iris$Sepal.Length, INDEX = iris$Species, FUN = mean)
    setosa versicolor  virginica 
     5.006      5.936      6.588 

User-defined functions (1)

  • Abstracting code into many small functions is key for writing nice R code.
  • Functions are defined by code with a specific format.
function_name <- function(arg1, arg2, arg3 = NULL, ...) {
  ## code here...
  return(...)  
}

User-defined functions (2)

  • We have the following “ingredients” in a user-defined function:
    • function_name: the name of the function (case sensitive);
    • arg1, arg2, arg3, …: input values;
    • arg3 = NULL: default value. If arg3 is not provided when calling the function, NULL will be used;
    • return(): the output value (not mandatory)

User-defined functions: example (3)

  • How to instantiate and call a function.
  • Here is a function to compute the sum of the first n integer numbers:
sum_int <- function(n) {
  s <- sum(1:n)
  return(s)
}
sum_int(n = 100)
[1] 5050
  • Note that the call to the function must be invoked after defining the function.

User-defined functions: example (4)

  • Define a function to compute the p-norm of a vector x. By default, compute the Euclidean norm (p = 2).
norm_p <- function(x, p = 2) {
  d <- sum(x^p)^(1/p)
  return(d)
}
print(norm_p(x = c(1, 1)))      ## Compute the Euclidean norm of the vector c(1,1)
[1] 1.414214
print(norm_p(x = c(1, 1), p=3)) ## Compute the 3-norm of the vector c(1,1)
[1] 1.259921

Functions technicalities (1)

  • The function first creates a temporary local environment.
  • It is nested within the global environment, which means that, from that local environment, you can also access any object from the global environment (DO NOT DO).
  • As soon as the function ends, the local environment is destroyed along with all the objects in it
#define the function
test1 <- function(){
  test_string <- 'This object is destroyed as soon as the function ends!'
  cat(test_string)
}
test1()
This object is destroyed as soon as the function ends!

Functions technicalities (2)

  • If R sees any object name, it first searches the local environment.
  • If it finds the object there, it uses that one; else it searches in the global environment for that object.
## global variable i
i <- 1
## define function
test2 <- function(){
  ## local variable i
  ## there is no i in the local environment -> search in parent environment
  i <- i * 10
  ## return
  return(i)
}

Functions technicalities (3)

## run function
test2()
[1] 10
  • Global i has not changed!
i
[1] 1

User-defined functions (1)

  • Best practice in programming with R would be to dedicate a script (within each project) to all written functions and call it with source(). This way, functions can be easily reused in multiple scripts.

  • Even better, even if it requires a little bit more work at the beginning, is to set up an R package. A very good reference on how to do it is here

User-defined functions (2)

  • Suppose you want to get the multiple linear regression coefficient estimates:

\[ \hat{\beta} = (X^T X)^{-1}X^T y \]

(Refer to regression courses for this).

Then, you can save the code for getting this in a function stored in a file called my_regr_coeff.R and use source("my_regr_coeff.R") to load it in the environment

User-defined functions (3)

The function might be the following

my_rc <- function(X, Y, add_intercept = TRUE) {
  n <- length(X)
  if (add_intercept) {
    X2 <- cbind(rep(1, n), X)
    ## t(X) returns the transpose of X
    ## solve(X) returns the inverse of X
    b <- solve(t(X2) %*% X2) %*% t(X2) %*% Y
  }
  if (!add_intercept) {
    b <- solve(t(X) %*% X) %*% t(X) %*% Y
    names(b) <- "beta_1"
  }
  if (length(b) > 1)
    return(list(beta_0 = b[1], beta_1 = b[2]))
  if (length(b) == 1)
    return(list(beta_1 = b[1]))
}

User-defined functions (4)

  • Then, by using the following code:
source("my_regr_coeff.R")

my_fit <- my_rc(X, Y)
  • You’ll get the results, where X is a matrix and Y is a vector.

User-defined functions (5)

  • Suppose you want to get the coefficients of \(\hat{\beta}\) with X = height and Y = weights of some people.

  • We use this code:

height = c(160, 172, 175, 168, 170, 171, 169, 165, 165, 160, 180, 186, 190, 170)
weights = c(55, 67, 80, 68, 72, 75, 70, 65, 62, 60, 85, 90, 92, 71)
source("my_regr_coeff.R")
fit <- my_rc(X = height, Y = weights)
fit
$beta_0
[1] -136.6742

$beta_1
[1] 1.218425

External data management in R: CSV (1)

  • A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values.
  • A CSV file stores tabular data (numbers and text) in plain text.
  • Each line of the file is a data record. Each record consists of one or more fields, separated by the delimiter.
  • CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications.
  • R makes it easy to export and import data in CSV format.

External data management in R: CSV export (2)

  • Many packages contain datasets that can be exported into CSV files locally.
data("mtcars")                          ## load the mtcars dataset
write.csv(mtcars, file = 'my_mtcars.csv')  ## export to file

External data management in R: CSV import (3)

  • The easiest way to read a CSV file is through read.csv().
  • This will give you automatically a data frame:
x <- read.csv('my_mtcars.csv')             ## read file 
head(x,n = 3)
              X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
  • Many (possibly better) options, see e.g., the readr and data.table packages

External data management in R: CSV from the net (4)

covid_daily_report = read.csv(file = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/jhu/full_data.csv")
head(covid_daily_report)
        date    location new_cases new_deaths total_cases total_deaths
1 2020-02-24 Afghanistan         5         NA           5           NA
2 2020-02-25 Afghanistan         0         NA           5           NA
3 2020-02-26 Afghanistan         0         NA           5           NA
4 2020-02-27 Afghanistan         0         NA           5           NA
5 2020-02-28 Afghanistan         0         NA           5           NA
6 2020-02-29 Afghanistan         0         NA           5           NA
  weekly_cases weekly_deaths biweekly_cases biweekly_deaths
1           NA            NA             NA              NA
2           NA            NA             NA              NA
3           NA            NA             NA              NA
4           NA            NA             NA              NA
5           NA            NA             NA              NA
6            5            NA             NA              NA

External data management in R: many types (5)

  • Other file formats may contain data that you wish to load in R
  • Some formats include txt, tsv, xlsx, SAV, STATA, mat, Rdata, rds etc..
  • The Import Dataset Addin in Rstudio is a good starting point
  • An exhaustive list is both boring and useless, as you will end up looking on the internet on how to do it

Meme of the day

External data management in R: pkg (6)

  • The quantmod package provides a very suitable function for downloading financial data from the web.
  • The main function is called getSymbols. The function works with a variety of sources.

External data management in R: quantmod (7)

  • For currencies, the oanda source is used.
library(quantmod)
x <- getSymbols(Symbols = 'EUR/USD', src = 'oanda', auto.assign = FALSE)   
tail(x)
           EUR.USD
2024-09-28 1.11640
2024-09-29 1.11644
2024-09-30 1.11610
2024-10-01 1.10956
2024-10-02 1.10581
2024-10-03 1.10334

External data management in R: quantmod (8)

  • For economic series, the FRED source is used.
  • This example retrieves the Japanese GDP:
## retrieve the historical Gross Domestic Product for Japan
x <- getSymbols(Symbols = 'JPNNGDP', src = 'FRED', auto.assign = FALSE)   
tail(x)
            JPNNGDP
2023-01-01 583288.0
2023-04-01 594937.2
2023-07-01 594562.4
2023-10-01 598615.8
2024-01-01 597132.0
2024-04-01 607903.7